PyTorch Fundamentals

Reading time: ~50 minutes | Level: Intermediate-Advanced

The Silent Bug

The model trains fine for 10 epochs. Loss is decreasing. Then you notice something: the loss at epoch 2 is identical to epoch 1. So is epoch 3. The model is not learning at all.

import torch
import torch.nn as nn

model = nn.Linear(10, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()

X = torch.randn(32, 10)
y = torch.randn(32, 1)

for epoch in range(5):
    pred = model(X)
    loss = criterion(pred, y)
    loss.backward()
    optimizer.step()    # weights updated
    # optimizer.zero_grad() is missing!
    print(f"Epoch {epoch}: loss={loss.item():.4f}")

# Output:
# Epoch 0: loss=1.2341
# Epoch 1: loss=1.2341  <-- same!
# Epoch 2: loss=1.2341  <-- same!

Every epoch, loss.backward() accumulates gradients on top of the previous epoch's gradients. The optimizer's step() uses the sum of all accumulated gradients -- a signal that grows with each epoch and corrupts the update direction. Without optimizer.zero_grad(), the model is not training with the current batch's gradient; it is training with the sum of every gradient since the start.

This is the most common PyTorch bug. This lesson explains the mechanics of why it happens and gives you the mental model to never make it again.

Why This Matters

PyTorch is the dominant framework for research and increasingly for production deep learning. Unlike high-level frameworks that hide the training loop, PyTorch exposes it -- which means you have full control and full responsibility. Understanding the internals is not optional for ML engineering:

Tensor internals determine memory usage, contiguity bugs, and GPU transfer costs
Autograd is the foundation of every gradient-based algorithm; debugging loss spikes requires understanding the computation graph
nn.Module is the building block of every architecture; hooks, buffers, and parameter registration enable techniques from distributed training to quantisation
DataLoader/Dataset performance directly affects GPU utilisation
Device management errors are among the most common runtime crashes

1. Tensor Internals: Storage, Strides, and Contiguity

A PyTorch tensor is not just an array of numbers. It is a view over a flat block of memory, described by four components: storage, dtype, shape, and strides.

import torch

# Create a tensor
x = torch.arange(12, dtype=torch.float32).reshape(3, 4)
print(x)
# tensor([[ 0.,  1.,  2.,  3.],
#         [ 4.,  5.,  6.,  7.],
#         [ 8.,  9., 10., 11.]])

# Storage: the flat underlying data
print(x.storage()[:])      # tensor([0., 1., 2., ..., 11.])

# Strides: how many storage elements to skip to advance by 1 in each dimension
# For a 3x4 row-major tensor: stride=(4,1)
# To advance one row: skip 4 elements. To advance one column: skip 1 element.
print(x.stride())          # (4, 1)
print(x.storage_offset())  # 0  -- starts at the beginning of storage

# Transpose does NOT copy data -- it just swaps the strides
x_t = x.T
print(x_t.shape)    # (4, 3)
print(x_t.stride()) # (1, 4)  -- now column-major

# is_contiguous checks whether the strides match C-order layout
print(x.is_contiguous())    # True
print(x_t.is_contiguous())  # False -- the transpose is a non-contiguous view

# Operations that require contiguous memory will raise an error on x_t.
# Fix with .contiguous() which creates a new, C-order copy:
x_t_cont = x_t.contiguous()
print(x_t_cont.stride())    # (3, 1)  -- now C-order for a 4x3 tensor

# Why this matters for ML:
# 1. view() requires contiguous tensors -- use reshape() which handles it automatically
# 2. Non-contiguous tensors transferred to GPU incur an extra copy
# 3. Some CUDA kernels require contiguous inputs (e.g. certain cuDNN operations)
try:
    x_t.view(-1)
except RuntimeError as e:
    print(e)   # "view size is not compatible with input tensor's size and stride"

x_t.reshape(-1)   # works -- internally calls contiguous() if needed

Memory sharing: slices and transposes share the same underlying storage. Writing to a slice modifies the original tensor.

a = torch.ones(4)
b = a[:2]       # b is a VIEW of a's storage
b[0] = 99.0
print(a)        # tensor([99., 1., 1., 1.])  -- a was modified!

# Use .clone() to get an independent copy
c = a[:2].clone()
c[0] = 0.0
print(a)        # unchanged

2. Autograd: The Computation Graph

PyTorch builds a dynamic computation graph as operations execute. Every tensor that has requires_grad=True records its creation operation and its inputs. When you call .backward(), PyTorch traverses this graph in reverse, applying the chain rule to accumulate gradients.

import torch

# Leaf tensors -- created by the user, not by an operation
x = torch.tensor([2.0, 3.0], requires_grad=True)

# Operations on x create non-leaf tensors with grad_fn
y = x * 2          # y.grad_fn = MulBackward0
z = y + 1          # z.grad_fn = AddBackward0
loss = z.mean()    # loss.grad_fn = MeanBackward0

print(x.is_leaf)    # True  -- created by user, has no grad_fn
print(y.is_leaf)    # False -- created by an operation
print(loss.grad_fn) # <MeanBackward0 object>

# Backprop: computes d(loss)/d(x)
# loss = mean(x*2 + 1) = (2x1 + 1 + 2x2 + 1) / 2
# d(loss)/d(xi) = 2/2 = 1.0 for each element
loss.backward()
print(x.grad)       # tensor([1., 1.])

# --- retain_graph ---
# By default, the computation graph is freed after .backward().
# If you need to backprop through the same graph twice (e.g. in MAML,
# or when computing higher-order derivatives), use retain_graph=True:
x2 = torch.tensor([1.0], requires_grad=True)
y2 = x2 ** 2
y2.backward(retain_graph=True)   # x2.grad = 2.0
x2.grad.zero_()                  # clear gradient before second backward
y2.backward()                    # works because retain_graph=True was used
print(x2.grad)                   # 2.0

# --- detach ---
# Creates a tensor that shares storage but is not in the computation graph.
# Use for: stopping gradient flow, creating targets, computing metrics.
with torch.no_grad():
    detached = y2.detach()         # detached does not require grad
    print(detached.requires_grad)  # False

# Alternatively, use torch.no_grad() context manager for a block
with torch.no_grad():
    # No graph is built here -- faster and memory efficient for inference
    pred = model(X_test)
    accuracy = (pred.argmax(dim=1) == y_test).float().mean()

Gradient accumulation: because .backward() adds to .grad rather than replacing it, you must call optimizer.zero_grad() before each backward pass. This is also the basis for intentional gradient accumulation (simulating large batch sizes):

ACCUMULATION_STEPS = 4   # simulate batch size 4x larger

optimizer.zero_grad()
for i, (X_batch, y_batch) in enumerate(loader):
    pred = model(X_batch)
    loss = criterion(pred, y_batch) / ACCUMULATION_STEPS   # scale loss
    loss.backward()   # accumulate gradients

    if (i + 1) % ACCUMULATION_STEPS == 0:
        optimizer.step()
        optimizer.zero_grad()   # flush accumulated gradients

3. nn.Module Anatomy

nn.Module is the base class for all neural network components. Understanding its internals lets you write custom layers, debug architecture bugs, and use advanced features like hooks.

import torch
import torch.nn as nn

class TwoLayerMLP(nn.Module):
    def __init__(self, in_features: int, hidden: int, out_features: int) -> None:
        super().__init__()
        # Attributes that are nn.Module subclasses are automatically registered
        # as submodules and their parameters are included in self.parameters()
        self.fc1   = nn.Linear(in_features, hidden)
        self.relu  = nn.ReLU()
        self.drop  = nn.Dropout(p=0.3)
        self.fc2   = nn.Linear(hidden, out_features)

        # nn.Parameter wraps a tensor and registers it as a learnable parameter
        # This is how you add non-standard learnable weights (e.g. in custom attention)
        self.scale = nn.Parameter(torch.ones(1))

        # register_buffer registers a tensor that is NOT a parameter (not learnable)
        # but IS part of the model state -- saved/loaded with state_dict.
        # Use for: running statistics (BatchNorm), positional embeddings, masks.
        self.register_buffer("bias_correction", torch.zeros(out_features))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.fc1(x)
        x = self.relu(x)
        x = self.drop(x)            # only active in training mode
        x = self.fc2(x)
        x = x * self.scale          # learnable scaling factor
        x = x + self.bias_correction  # buffer used in forward pass
        return x

model = TwoLayerMLP(10, 64, 4)

# Inspect parameters
for name, param in model.named_parameters():
    print(f"{name:25s}  shape={tuple(param.shape)}  requires_grad={param.requires_grad}")
# fc1.weight                  shape=(64, 10)  requires_grad=True
# fc1.bias                    shape=(64,)     requires_grad=True
# fc2.weight                  shape=(4, 64)   requires_grad=True
# fc2.bias                    shape=(4,)      requires_grad=True
# scale                       shape=(1,)      requires_grad=True

# Inspect buffers
for name, buf in model.named_buffers():
    print(f"{name}  shape={tuple(buf.shape)}  requires_grad={buf.requires_grad}")
# bias_correction  shape=(4,)  requires_grad=False

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable    = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total: {total_params}, Trainable: {trainable}")

Forward Hooks for Debugging

Hooks let you inspect or modify tensor values during the forward or backward pass without modifying the module.

import torch
import torch.nn as nn

model = TwoLayerMLP(10, 64, 4)

# A forward hook receives (module, input_tuple, output_tensor)
activations = {}

def capture_activation(module, input, output):
    # Detach to avoid holding the computation graph in memory
    activations[module] = output.detach().cpu()

# Register hook on a specific layer
hook = model.fc1.register_forward_hook(capture_activation)

# Run a forward pass
x = torch.randn(8, 10)
_ = model(x)

print(activations[model.fc1].shape)   # (8, 64)
print(activations[model.fc1].mean())  # mean activation post-fc1

# Always remove hooks when done -- they persist and affect every forward pass
hook.remove()

# Gradient hooks for debugging vanishing/exploding gradients
def log_grad_norm(module, grad_input, grad_output):
    for g in grad_output:
        if g is not None:
            print(f"{module.__class__.__name__} grad norm: {g.norm().item():.4f}")

hook_bwd = model.fc1.register_full_backward_hook(log_grad_norm)
# ... run forward + backward ...
hook_bwd.remove()

4. The Training Loop

The canonical PyTorch training loop, written defensively:

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from typing import Callable

def train_epoch(
    model: nn.Module,
    loader: DataLoader,
    criterion: Callable,
    optimizer: torch.optim.Optimizer,
    device: torch.device,
    grad_clip: float | None = None,
) -> float:
    """Runs one training epoch. Returns mean loss."""
    model.train()   # enables Dropout, BatchNorm running stats update
    total_loss = 0.0

    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device, non_blocking=True)   # async transfer
        y_batch = y_batch.to(device, non_blocking=True)

        # 1. Zero gradients BEFORE the forward pass
        optimizer.zero_grad()

        # 2. Forward pass
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)

        # 3. Backward pass
        loss.backward()

        # 4. Optional gradient clipping (prevents gradient explosion in RNNs/Transformers)
        if grad_clip is not None:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip)

        # 5. Parameter update
        optimizer.step()

        total_loss += loss.item()   # .item() extracts scalar, detaches from graph

    return total_loss / len(loader)


@torch.no_grad()   # disables gradient tracking for the entire function
def evaluate(
    model: nn.Module,
    loader: DataLoader,
    criterion: Callable,
    device: torch.device,
) -> tuple[float, float]:
    """Returns (mean_loss, accuracy)."""
    model.eval()    # disables Dropout, uses running stats for BatchNorm
    total_loss = 0.0
    correct    = 0
    total      = 0

    for X_batch, y_batch in loader:
        X_batch = X_batch.to(device, non_blocking=True)
        y_batch = y_batch.to(device, non_blocking=True)

        logits = model(X_batch)
        loss   = criterion(logits, y_batch)

        total_loss += loss.item()
        preds       = logits.argmax(dim=1)
        correct    += (preds == y_batch).sum().item()
        total      += y_batch.size(0)

    return total_loss / len(loader), correct / total


def train(
    model: nn.Module,
    train_loader: DataLoader,
    val_loader: DataLoader,
    n_epochs: int = 50,
    lr: float = 1e-3,
    device: torch.device | None = None,
    patience: int = 5,
) -> dict:
    """Full training loop with early stopping."""
    if device is None:
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model = model.to(device)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-4)
    criterion = nn.CrossEntropyLoss()
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epochs)

    history = {"train_loss": [], "val_loss": [], "val_acc": []}
    best_val_loss = float("inf")
    no_improve    = 0
    best_state    = None

    for epoch in range(1, n_epochs + 1):
        train_loss            = train_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc     = evaluate(model, val_loader, criterion, device)
        scheduler.step()

        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["val_acc"].append(val_acc)

        print(f"Epoch {epoch:3d}/{n_epochs}  "
              f"train_loss={train_loss:.4f}  val_loss={val_loss:.4f}  val_acc={val_acc:.4f}")

        # Early stopping
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            no_improve    = 0
            # Save the best weights (deep copy via state_dict)
            best_state = {k: v.clone() for k, v in model.state_dict().items()}
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f"Early stopping at epoch {epoch}")
                break

    # Restore best weights before returning
    if best_state is not None:
        model.load_state_dict(best_state)

    return history

The five mandatory steps: (1) optimizer.zero_grad(), (2) forward pass, (3) loss.backward(), (4) optional clipping, (5) optimizer.step(). The only step that varies between implementations is whether zero_grad comes before or after step. Before is preferred because it makes the mental model cleaner.

5. Dataset and DataLoader

import torch
from torch.utils.data import Dataset, DataLoader
import numpy as np

class TabularDataset(Dataset):
    """
    Wraps numpy arrays as a PyTorch Dataset.

    Dataset.__len__ and Dataset.__getitem__ are the only required methods.
    PyTorch's DataLoader uses these to construct batches.
    """

    def __init__(self, X: np.ndarray, y: np.ndarray) -> None:
        # Convert once at construction, not per-item -- much faster
        self.X = torch.from_numpy(X).float()
        self.y = torch.from_numpy(y).long()

    def __len__(self) -> int:
        return len(self.X)

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        return self.X[idx], self.y[idx]


# DataLoader handles batching, shuffling, and multi-process loading
train_dataset = TabularDataset(X_train_np, y_train_np)
val_dataset   = TabularDataset(X_val_np,   y_val_np)

train_loader = DataLoader(
    train_dataset,
    batch_size=64,
    shuffle=True,         # shuffle training data each epoch
    num_workers=4,        # parallel data loading in background processes
    pin_memory=True,      # page-lock host memory for faster GPU transfer
    drop_last=True,       # drop the last incomplete batch (stabilises BatchNorm)
    persistent_workers=True,  # keep workers alive between epochs (reduces spawn cost)
)

val_loader = DataLoader(
    val_dataset,
    batch_size=128,       # larger batch is fine for evaluation (no gradients)
    shuffle=False,        # never shuffle validation data
    num_workers=2,
    pin_memory=True,
)

# Inspect a batch
X_batch, y_batch = next(iter(train_loader))
print(X_batch.shape, y_batch.shape)

num_workers tuning: start with num_workers=4 and increase until GPU utilisation stops improving. On macOS, set num_workers=0 (the macOS multiprocessing fork model causes issues). On Colab/Kaggle, 2 is usually optimal.

6. GPU Device Management

import torch
import torch.nn as nn

# --- Device selection ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# On Apple Silicon:
# device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

print(f"Using device: {device}")
if device.type == "cuda":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# --- Moving tensors and models ---
model = TwoLayerMLP(10, 128, 4)
model = model.to(device)   # moves all parameters and buffers

# Move tensors
x = torch.randn(32, 10).to(device)          # using .to(device) -- preferred
x_gpu = torch.randn(32, 10).cuda()          # older API -- hardcodes CUDA

# A common bug: mixing CPU and GPU tensors
try:
    y_cpu = torch.randn(32, 10)
    result = x + y_cpu   # RuntimeError: CPU and CUDA tensors are not mixable
except RuntimeError as e:
    print(e)

# Always move tensors to the same device as the model before the forward pass

# --- non_blocking transfers ---
# Without non_blocking: the CPU waits for the GPU to complete the transfer
# With non_blocking=True on pinned memory: transfer happens asynchronously;
# the CPU can prepare the next batch while the GPU receives the current one
X_batch = X_batch.to(device, non_blocking=True)   # overlaps CPU/GPU work

# --- Checking and managing GPU memory ---
if device.type == "cuda":
    print(torch.cuda.memory_allocated() / 1e6, "MB allocated")
    print(torch.cuda.memory_reserved() / 1e6,  "MB reserved (cached)")
    torch.cuda.empty_cache()   # release cached memory back to OS (does not free allocated memory)

# --- Context manager for mixed precision ---
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()   # scales loss to prevent fp16 underflow

for X_batch, y_batch in train_loader:
    X_batch = X_batch.to(device)
    y_batch = y_batch.to(device)
    optimizer.zero_grad()

    with autocast():    # runs forward pass in fp16 where safe, fp32 elsewhere
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)

    scaler.scale(loss).backward()   # scale gradients to avoid underflow
    scaler.step(optimizer)          # unscale before optimizer.step()
    scaler.update()                 # adjust scale factor for next iteration

Mixed precision training (AMP) uses float16 for the forward pass and float32 for the weight update. It typically halves memory usage and runs 1.5-3x faster on modern GPUs with Tensor Cores.

7. Model Saving and Loading

import torch
import torch.nn as nn
from pathlib import Path

# --- state_dict: the recommended approach ---
def save_checkpoint(
    model: nn.Module,
    optimizer: torch.optim.Optimizer,
    epoch: int,
    loss: float,
    path: str | Path,
) -> None:
    """
    Saves model weights + optimizer state + training metadata.

    Why save optimizer state? Optimizer has momentum/adaptive terms (Adam).
    Without it, resuming training resets these -- the first few resumed epochs
    behave differently from uninterrupted training.
    """
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    torch.save({
        "epoch":                epoch,
        "model_state_dict":     model.state_dict(),
        "optimizer_state_dict": optimizer.state_dict(),
        "loss":                 loss,
    }, path)
    print(f"Checkpoint saved: {path}")


def load_checkpoint(
    path: str | Path,
    model: nn.Module,
    optimizer: torch.optim.Optimizer | None = None,
    device: torch.device | None = None,
) -> dict:
    """
    Loads a checkpoint. Returns the metadata dict.

    map_location ensures tensors are loaded onto the target device,
    not whatever device they were saved from.
    """
    if device is None:
        device = torch.device("cpu")

    checkpoint = torch.load(path, map_location=device)
    model.load_state_dict(checkpoint["model_state_dict"])

    if optimizer is not None:
        optimizer.load_state_dict(checkpoint["optimizer_state_dict"])

    return checkpoint


# --- Inference-only export ---
# For deployment, save only the state_dict (no optimizer state)
torch.save(model.state_dict(), "model_weights.pt")

# Load for inference
model = TwoLayerMLP(10, 128, 4)
model.load_state_dict(torch.load("model_weights.pt", map_location="cpu"))
model.eval()   # always set eval mode before inference

# --- TorchScript export (for C++ deployment or locked Python environments) ---
# Scripting traces the model's Python logic and freezes it into a graph
scripted = torch.jit.script(model)
scripted.save("model_scripted.pt")

loaded_script = torch.jit.load("model_scripted.pt")
with torch.no_grad():
    out = loaded_script(torch.randn(1, 10))

# --- ONNX export (for deployment across frameworks: TensorRT, ONNXRuntime) ---
dummy_input = torch.randn(1, 10)
torch.onnx.export(
    model, dummy_input, "model.onnx",
    opset_version=17,
    input_names=["features"],
    output_names=["logits"],
    dynamic_axes={"features": {0: "batch_size"}, "logits": {0: "batch_size"}},
)

8. Common Bugs Catalogue

Bug 1: Forgetting zero_grad (opening scenario)

# BAD
for X, y in loader:
    pred = model(X)
    loss = criterion(pred, y)
    loss.backward()
    optimizer.step()   # gradients accumulate every step!

# GOOD
for X, y in loader:
    optimizer.zero_grad()
    pred = model(X)
    loss = criterion(pred, y)
    loss.backward()
    optimizer.step()

Bug 2: In-place operations breaking autograd

# In-place operations modify a tensor's data without creating a new tensor.
# If an in-place op is applied to a tensor needed by the backward pass,
# autograd cannot compute the correct gradient and raises an error.

x = torch.randn(4, requires_grad=True)
y = x + 1

# BAD: y += 1 is in-place (y.__iadd__)
y += 1   # RuntimeError: a leaf Variable that requires grad has been used in an in-place operation

# GOOD: create a new tensor
y = y + 1   # out-of-place; y is now a new tensor

Bug 3: Calling model.eval() but forgetting torch.no_grad()

# model.eval() disables Dropout and uses BatchNorm running stats -- good.
# But it does NOT stop the computation graph from being built.
# Without torch.no_grad(), autograd still tracks every operation during inference.

model.eval()

# BAD: graph is built, memory is wasted
with torch.no_grad():
    pass  # missing!
outputs = model(X_test)   # builds a graph nobody will use

# GOOD
model.eval()
with torch.no_grad():
    outputs = model(X_test)   # no graph built -- 2x faster, uses less memory

Bug 4: Calling .item() inside the loss accumulation loop

# .item() is cheap but it synchronises the CPU and GPU.
# Calling it on every batch in a tight training loop causes GPU pipeline stalls.

# ACCEPTABLE for debugging
for batch in loader:
    ...
    print(loss.item())   # forces GPU sync on every batch

# BETTER: accumulate tensor losses, sync only once per epoch
epoch_loss = torch.tensor(0.0, device=device)
for batch in loader:
    ...
    epoch_loss += loss.detach()   # detach to avoid holding graph

avg_loss = (epoch_loss / len(loader)).item()   # single sync at end of epoch

Bug 5: model.train() / model.eval() in the wrong place

# BatchNorm and Dropout behave differently in train vs eval mode.
# Forgetting to switch causes:
# - Dropout active during evaluation: non-deterministic, lower accuracy
# - BatchNorm uses batch stats instead of running stats during inference:
#   predictions change with batch size

# ALWAYS:
model.train()    # at the start of each training epoch
model.eval()     # at the start of evaluation and inference

9. A Complete Minimal Example

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import numpy as np

# --- Data ---
rng = np.random.default_rng(42)
n = 1000
X_np = rng.normal(size=(n, 20)).astype(np.float32)
w    = rng.normal(size=(20,)).astype(np.float32)
y_np = (X_np @ w > 0).astype(np.int64)

X_train_t = torch.from_numpy(X_np[:800])
y_train_t = torch.from_numpy(y_np[:800])
X_val_t   = torch.from_numpy(X_np[800:])
y_val_t   = torch.from_numpy(y_np[800:])

train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=64, shuffle=True)
val_loader   = DataLoader(TensorDataset(X_val_t,   y_val_t),   batch_size=200)

# --- Model ---
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = nn.Sequential(
    nn.Linear(20, 64), nn.ReLU(), nn.Dropout(0.2),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 2),
).to(device)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()

# --- Training loop ---
for epoch in range(20):
    # Training
    model.train()
    for X_b, y_b in train_loader:
        X_b, y_b = X_b.to(device), y_b.to(device)
        optimizer.zero_grad()
        loss = criterion(model(X_b), y_b)
        loss.backward()
        optimizer.step()

    # Evaluation
    model.eval()
    with torch.no_grad():
        X_val_d = X_val_t.to(device)
        y_val_d = y_val_t.to(device)
        logits  = model(X_val_d)
        val_acc = (logits.argmax(1) == y_val_d).float().mean().item()

    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1:2d}  val_acc={val_acc:.4f}")

Key Takeaways

Tensors are views over flat storage described by strides. Transpose does not copy data; it changes strides. Non-contiguous tensors fail view() -- use reshape() or .contiguous().view().
Autograd builds a dynamic computation graph during the forward pass and traverses it in reverse during .backward(). Gradients accumulate: you must call optimizer.zero_grad() before every backward pass.
model.train() and model.eval() switch Dropout and BatchNorm behaviour. Forgetting them causes silent, hard-to-diagnose accuracy differences between training and evaluation.
Always pair model.eval() with torch.no_grad() at inference time. eval() changes behaviour; no_grad() stops the graph from being built, saving memory and time.
nn.Parameter for learnable tensors, register_buffer for non-learnable state that should be saved with the model (running stats, masks, positional embeddings).
Save checkpoints with state_dict not the full model object. Include optimizer state to resume training correctly. Use map_location when loading across devices.
Mixed precision (AMP) with autocast + GradScaler halves VRAM usage and accelerates training on modern GPUs with near-zero code changes.
pin_memory=True + non_blocking=True in DataLoader/transfer pipeline enables CPU-GPU data transfer to overlap with GPU computation.

Practice Problems

Problem 1 -- Custom Layer Implement a GatedLinearUnit (GLU) layer: given input $x$ of shape $(B, 2d)$ , split it into two halves $a, b$ of shape $(B, d)$ , and return $a \odot \sigma(b)$ where $\odot$ is element-wise multiplication and $\sigma$ is sigmoid. Register it as an nn.Module with learnable weight and bias. Verify that gradients flow through it correctly by checking that param.grad is non-None after a backward pass.

Problem 2 -- Gradient Norm Monitoring Write a training loop that logs the L2 norm of all parameter gradients after each backward pass (before the optimizer step). Plot the gradient norms over 50 epochs for three models initialised differently: Xavier uniform, Kaiming normal, and all-ones. Observe how bad initialisation leads to vanishing or exploding gradients in the first few epochs.

Problem 3 -- Learning Rate Finder Implement a learning rate range test (Smith 2015): increase the learning rate exponentially from lr_min=1e-6 to lr_max=10 over 100 mini-batches, record the loss at each step, and plot loss vs learning rate on a log scale. The optimal LR is approximately one decade below where the loss begins to diverge. This is the foundation of the torch-lr-finder library.

Problem 4 -- Custom Autograd Function Implement the GELU activation function as a custom torch.autograd.Function with explicit forward and backward methods. Compare the gradient values to PyTorch's built-in nn.GELU using torch.autograd.gradcheck. GELU is defined as $x \Phi(x)$ where $\Phi$ is the standard normal CDF, approximated as $0.5 x (1 + \tanh(\sqrt{2/\pi} (x + 0.044715 x^3)))$ .

Problem 5 -- Multi-GPU DataParallel Wrap a model in nn.DataParallel (or nn.parallel.DistributedDataParallel if you have access to a multi-GPU machine). Benchmark training throughput (samples/second) on 1 GPU vs 2 GPUs. Document the scaling efficiency and the overhead sources (batch splitting, gradient reduction, parameter server synchronisation).

The Silent Bug​

Why This Matters​

1. Tensor Internals: Storage, Strides, and Contiguity​

2. Autograd: The Computation Graph​

3. nn.Module Anatomy​

Forward Hooks for Debugging​

4. The Training Loop​

5. Dataset and DataLoader​

6. GPU Device Management​

7. Model Saving and Loading​

8. Common Bugs Catalogue​

Bug 1: Forgetting zero_grad (opening scenario)​

Bug 2: In-place operations breaking autograd​

Bug 3: Calling model.eval() but forgetting torch.no_grad()​

Bug 4: Calling .item() inside the loss accumulation loop​

Bug 5: model.train() / model.eval() in the wrong place​

9. A Complete Minimal Example​

Key Takeaways​

Practice Problems​